Syntagmatic and Paradigmatic Representations of Term Variation

نویسنده

  • Christian Jacquemin
چکیده

A two-tier model for the description of morphological, syntactic and semantic variations of multi-word terms is presented. It is applied to term normalization of French and English corpora in the medical and agricultural domains. Five different sources of morphological and semantic knowledge are exploited (MULTEXT, CELEX, AGROVOC, WordNetl.6, and Microsoft Word97 thesaurus). 1 I n t r o d u c t i o n In the classical approach to text retrieval, terms are assigned to queries and documents. The terms are generated by a process called automatic indexing. Then, given a query, the similarity between the query and the documents is computed and a ranked list of documents is produced as output of the system for information access (Salton and McGill, 1983). The similarity between queries and documents depends on the terms they have in common. The same concept can be formulated in many different ways, known as variants, which should be conflated in order to avoid missing relevant documents. For this purpose, this paper proposes a novel model of term variation that integrates linguistic knowledge and performs accurate term normalization. It relies on previous or ongoing linguistic studies on this topic (Sparck Jones and Tait, 1984; Jacquemin et al., 1997; Hamon et al., 1998). Terms are described in a two-tier framework composed of a paradigmatic level and a syntagmatic level that account for the three linguistic dimensions of term variability (morphology, syntax, and semantics). Term variants are extracted from tagged corpora through FAST R 1, a unification-based transformational parser described in (Jacquemin et al., 1997). Four experiments are performed on the French and the English languages and a measure of precision is provided for each of them. Two experiments are made on a French corpus [AGRIC] composed of 1.2 x 106 words of scientific abstracts in I FASTR can be downloaded www. limsi, f r/Individu/j acquemi/FASTR. from the agricultural domain and two on an English corpus [MEDIC] composed of 1.3 x 106 words of scientific abstracts in the medical domain. The two experiments in the French language are [AGRIC] + Word97 and [AGRIC] + AGROVOC. In the former, synonymy links are extracted from the Microsoft Word97 thesaurus; in the latter, semantic classes are extracted from the AGROVOC thesaurus, a thesaurus specialized in the agricultural domain (AGROVOC, 1995). In both experiments, morphological data are produced by a stemming algorithm applied to the MULTEXT lexical database (MULTEXT, 1998). The two experiments on the English language are [MEDIC] + WordNet 1.6 or [MEDIC] + Word97; they correspond to two different sources of semantic knowledge. In both cases, the morphological data are extracted from CELEX (CELEX, 1998). 2 T e r m V a r i a t i o n : R e p r e s e n t a t i o n a n d E x p l o i t a t i o n Terms and variations are represented into two parallel frameworks illustrated by Figure 1. While terms are described by a unique pair composed of a structure--at the syntagmatic level--and a set of lexical items--at the paradigmatic level--, a variation is represented by a pair of such pairs: one of them is the source term (or normalized term) and the other one is the target term (or variant). The syntagmatic description of a term is a context free rule; it is complemented with lexical information embedded in a feature structure denoted by constraints between paths and values. For instance, the term speed measurement is represented by: { Syntagm:{i°-+N2N1} } (N1 lemma) = measurement Paradigm: {N2 lemma> = speed (1) This term is a noun phrase composed of a head noun N1 and a modifier N2; the lemmas are given by the constraints at the paradigmatic level. This framework is similar to the unification-based representation of context-free grammars of (Shieber, 1992).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Correspondence of Syntagmatic and Paradigmatic Axes Relations, and Their Transformation in Relation to the Communicative Role of Shahnameh Illustration in Shiraz School of Miniature

When treated like texts with their own visual language, illustrations from the Shiraz School of miniature are a mixture of the syntagmatic and paradigmatic relations of signs. Syntagmatic relations reveal the different ways the elements of a text are connected, while paradigmatic relations identify the sets of signifiers that signify the content of the text, dealing with intratextual and intert...

متن کامل

Contrasting Syntagmatic and Paradigmatic Relations: Insights from Distributional Semantic Models

This paper presents a large-scale evaluation of bag-of-words distributional models on two datasets from priming experiments involving syntagmatic and paradigmatic relations. We interpret the variation in performance achieved by different settings of the model parameters as an indication of which aspects of distributional patterns characterize these types of relations. Contrary to what has been ...

متن کامل

Syntagmatic Processes

Traditionally, syntagmatic processes refer to the influence of “horizontal” elements on a word or phrase, in contradistinction to paradigmatic processes, which refer to “vertical” or alternative substitutions in a phrase. The term had significant currency in early and mid-twentieth century linguistics from Saussure on, and helped to define the formal study of syntax as widely practiced today. T...

متن کامل

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Vector space representation of words has been widely used to capture fine-grained linguistic regularities, and proven to be successful in various natural language processing tasks in recent years. However, existing models for learning word representations focus on either syntagmatic or paradigmatic relations alone. In this paper, we argue that it is beneficial to jointly modeling both relations...

متن کامل

Learning Syntactic Categories Using Paradigmatic Representations of Word Context

We investigate paradigmatic representations of word context in the domain of unsupervised syntactic category acquisition. Paradigmatic representations of word context are based on potential substitutes of a word in contrast to syntagmatic representations based on properties of neighboring words. We compare a bigram based baseline model with several paradigmatic models and demonstrate significan...

متن کامل

A Vector Model for Syntagmatic and Paradigmatic Relatedness

This paper introduces context digests, high-dimensional real-valued representations for the typical left and right contexts of a word. Initial entries for the context digests are formed from the word’s close left and right neighbors. A singular value decomposition reduces the dimensionality of the space to enable subsequent efficient processing. In contrast to similar techniques, no preprocesso...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999